feat(base): per-token `is_content` mask for body/scaffold attribution by snimu · Pull Request #53 · PrimeIntellect-ai/renderers

snimu · 2026-05-19T12:10:58Z

Summary

Adds is_content: list[bool] to RenderedTokens — a per-token signal that generalises sampled_mask across all roles: True iff the token came from message-body bytes (caller-provided content / tool_calls / reasoning_content, or the model's sampled emission for assistant), False iff template scaffolding (role tags, closers when not sampled, separators, tool-response wraps, tools-header block, generation prompt).
Adds build_training_sample(..., content_sft_roles={"tool"}) so a single render produces a loss mask that combines RL on assistant tokens with SFT on tool response bodies — without supervising the surrounding <|tool_response> / role-tag specials that would interrupt a real rollout.
Wires the mask through every hand-coded renderer (13 of them). Token IDs are byte-identical with apply_chat_template.
renderers.client.generate() returns the renderer's per-token attribution as prompt_attribution: RenderedTokens, so downstream consumers (verifiers RendererClient → prime-rl) carry the body/scaffold cut to the trainer without re-rendering.

Motivation

For RL the policy loss applies only to tokens the model emitted. A useful auxiliary objective is SFT on tool response bodies — supervise the model to anticipate what tools return, without supervising the wrap. If the model learns to emit <|tool_response> itself, it can derail a rollout by short-circuiting the harness.

sampled_mask answers "would the model emit this?", which is the right cut for assistant tokens but is uniformly False on non-assistant roles. There is no way to ask "which tokens came from message-body bytes" on tool / user / system messages using sampled_mask alone.

is_content is that signal. For a tool message wrapped as <|im_start|>user\n<|tool_response>\n{body}\n<|tool_response_end|><|im_end|>\n, is_content is True only on the {body} tokens — never on the <|tool_response> specials or the inter-section newlines.

By construction is_content == sampled_mask over every assistant-attributed token; on every other role sampled_mask is uniformly False and is_content carries information sampled_mask cannot. is_content is a strict superset (or equal) of sampled_mask everywhere and never contradicts it.

API

On RenderedTokens:

is_content: list[bool] — same length / empty policy as sampled_mask. Empty means the renderer opts out (DefaultRenderer leaves it empty for the same reason it leaves sampled_mask empty: Jinja is opaque).
content_token_spans_by_role() -> dict[str, list[tuple[int, int]]] — contiguous body-only token runs grouped by message role.
content_mask_for_roles(roles) -> list[bool] — per-token bool mask, True only on body tokens whose message role is in the supplied set.

Module-level in renderers.base:

attribute_text_segments(tokenizer, segments) — single-BPE-pass attribution via offset_mapping. When the supplied tokenizer doesn't track offsets (fastokens patch), lazy-loads a vanilla offset-capable tokenizer for the same model and caches it process-globally.
build_training_sample(..., content_sft_roles=...) — opt-in body-only supervision for roles the model never samples. Falls back to the role_to_mask + sampled_mask behaviour when is_content is empty.

On renderers.client.generate():

Result field prompt_attribution: RenderedTokens | None — the per-token attribution for the prompt, either the one this call computes via render() internally or the one the caller threaded in alongside prompt_ids. Downstream consumers call attr.content_mask_for_roles({"tool"}) on it to build selective loss masks without re-rendering.
Parameter prompt_attribution: RenderedTokens | None = None — callers that pre-build prompt_ids (the multi-turn bridge path in verifiers) hand in the RenderedTokens that bridge_to_next_turn returned, and it surfaces on the result unchanged.

The new field on generate() mirrors the existing multi_modal_data sidecar — same shape, same None-default-when-unknown semantics.

How it works

Every renderer has emit sites like emit_text("user\n" + content, ...) that join wrap text and body text into one BPE pass to preserve token merges at the boundary. The emit_text_segments(...) helper (defined locally in each renderer) does the same join with per-token attribution:

Concatenate segment texts and run a single BPE encode.
Use the fast tokenizer's offset_mapping to recover each token's character span.
Attribute each token to whichever segment contains its first source character.

fastokens (the Rust BPE patched in by default for ~10x faster encode) doesn't track offsets. attribute_text_segments transparently loads a vanilla offset-capable tokenizer for the same model and caches it process-globally per model name. Most models in MODEL_RENDERER_MAP produce byte-identical token IDs between fastokens and vanilla, so the mix is safe; models in FASTOKENS_INCOMPATIBLE already use vanilla everywhere.

A few renderers use tokenizers that can't provide offset mapping at all and rely on per-renderer alternatives:

Kimi K2 family uses TikTokenTokenizer. Avoids concatenated wrap+body emits to begin with — Kimi's structure splits wrap and body at special-token boundaries, so threading is_content through the split emits suffices.
gpt-oss (Harmony) has an opaque prefix block. Diffs the rendered prefix against an empty-instructions render to recover the developer-instructions body span inside it.
MiniMax M2 has a BPE-merge between <response> and the body's first letter under certain tokenizer load orders. A local emit_token_overlap_body helper picks the overlap rule so the body's leading byte stays recoverable from its body run.

Per-renderer coverage

Renderer	Notes
`qwen3`	Reference implementation.
`qwen3.5`	XML-style tool calls. Auto-detected `enable_thinking` polarity preserved.
`qwen3.6`	Inherits from `Qwen35Renderer`; only overrides a pure string serializer, so it picks up `is_content` through the parent class.
`qwen3-vl`	`<\|image_pad\|>` placeholders are body (`is_content=True`); the surrounding `<\|vision_start\|>` / `<\|vision_end\|>` are scaffold.
`glm5` / `glm5.1`	`GLM5Renderer` covers both via subclass. Also covers `zai-org/GLM-4.7-Flash`.
`glm4.5`	`<\|observation\|>` / `<tool_response>` wraps are scaffold; body is content.
`kimi-k2`	`TikTokenTokenizer` — uses existing split-emit boundaries (no `attribute_text_segments`).
`kimi-k2.5` / `kimi-k2.6`	Multimodal: `<\|media_pad\|>` is body; `<\|media_begin\|>...<\|media_end\|>` wrap is scaffold.
`minimax-m2`	XML-style tool calls. `FASTOKENS_INCOMPATIBLE` (vanilla everywhere). Local overlap helper for `<response>` BPE merge.
`deepseek-v3`	`FASTOKENS_INCOMPATIBLE` (Metaspace pretokenizer). Standard wrap/body split.
`nemotron-3`	Tool body uses `emit_text_segments` for `\n` boundaries.
`laguna-xs.2`	Default-system header text is scaffold; caller-supplied system content is body.
`gpt-oss`	Harmony format. `functions.{name}` text on tool result messages is scaffold (comes from prior assistant `tool_calls`, not this tool's content).

DefaultRenderer leaves is_content empty.

Tests

tests/test_is_content.py — 10 invariants × 17-model matrix:

Length matches token_ids or is empty (opt-out).
is_content == sampled_mask over assistant tokens.
Generation-prompt tokens are is_content=False.
User / tool / system bodies are recoverable from the decoded is_content=True run.
First role-tag token is is_content=False.
content_token_spans_by_role() isolates tool body cleanly.
content_mask_for_roles({"tool"}) excludes assistant.
build_training_sample(..., content_sft_roles={"tool"}) trains tool body + assistant, never user.

tests/test_client.py covers the prompt_attribution surface on generate():

The parse-and-build test asserts prompt_attribution carries every populated RenderedTokens field through verbatim.
The bridge shape (caller passes both prompt_ids and prompt_attribution) passes attribution through unchanged.
The pre-built-prompt-without-attribution path returns None so callers can detect the gap.

Full suite collects 1557 tests — all pass (modulo pre-existing gpt-oss HF-parity skips and one unrelated xfailed). test_render_ids byte-identity vs apply_chat_template is green on every renderer.

Additional fixes

nemotron3: message_roles was sourced from the auto-injected normalised list, off-by-one when a default system was prepended. Now indexes the caller-provided message list.
kimi_k2: same off-by-one fixed via a caller_messages snapshot.

Notes for the maintainer

bridge_to_next_turn populates is_content on the bridge-emitted portion only; the prior portion (previous_prompt_ids + previous_completion_ids) gets [False] * len(previous_ids) per the same convention sampled_mask follows on bridge output. Consumers walk the trajectory and read each step's own is_content for full-conversation body masks.
The vanilla tokenizer loaded for offset_mapping is cached process-globally per model name (not per pool), so a 32-slot pool of the same model adds exactly one extra tokenizer to memory, not 32.

Note

[!NOTE]

Add per-token `is_content` body/scaffold attribution mask to all renderers

Adds an is_content: list[bool] field to RenderedTokens in renderers/base.py that marks each token as caller/model body (True) or template scaffolding (False).
Introduces attribute_text_segments in renderers/base.py to tokenize concatenated (text, is_content) segments in a single BPE pass using offset mapping, preserving merge boundaries while attributing each token to the correct segment.
Implements is_content population across all renderers (qwen3, qwen35, qwen3_vl, deepseek_v3, gpt_oss, kimi_k2, kimi_k25, laguna_xs2, minimax_m2, nemotron3, glm45, glm5), including render, bridge/bridge_to_next_turn, and assistant/tool helpers.
Extends build_training_sample with a content_sft_roles parameter that restricts loss to body-only tokens for specified roles using is_content, leaving behavior unchanged when the field is absent or empty.
Adds content_token_spans_by_role and content_mask_for_roles helpers to RenderedTokens for downstream span extraction.
Behavioral Change: assistant tokens enforce is_content == sampled_mask; message_roles in some renderers now reflects the original caller message list rather than the post-normalized list.

^{Macroscope summarized 16c04a9.}

Note

Medium Risk
Touches core token attribution and training mask construction across many model-specific renderers; mistakes could silently change loss masking or prompt formatting. Mitigated by keeping token IDs byte-identical and by adding plumbing/tests, but the breadth of renderer changes raises regression risk.

Overview
Adds a new per-token RenderedTokens.is_content signal to distinguish message body bytes from renderer-injected scaffold across all roles, plus helpers to extract body-only spans/masks.

Extends build_training_sample with content_sft_roles to optionally supervise body-only tokens for non-sampled roles (e.g. tool responses) while keeping scaffold tokens masked. Introduces attribute_text_segments with an offset-aware tokenizer cache to attribute tokens back to (text, is_content) segments without breaking BPE merges.

Plumbs attribution through bridge_to_next_turn and client.generate() via a new prompt_attribution parameter/return field, and updates all hand-coded renderers to populate is_content (including special handling for opaque/prefixed formats like gpt_oss and boundary-merge edge cases like minimax_m2).

^{Reviewed by Cursor Bugbot for commit 16c04a9. Bugbot is set up for automated code reviews on this repo. Configure here.}

Generalizes sampled_mask across all roles. is_content[k] is True iff token k came from message-body bytes — caller-provided content / tool_calls / reasoning_content, or the model's sampled emission for assistant — and False iff template scaffolding (role tags, closers when not sampled, inter-turn separators, tool-response wraps, tools-header block, generation prompt). By construction is_content == sampled_mask over every assistant-attributed token; carries new information on every other role where sampled_mask is uniformly False. Enables SFT on tool response bodies while applying RL only to assistant tokens — build_training_sample(..., content_sft_roles={"tool"}) trains the model to anticipate tool outputs without learning to emit the surrounding <|tool_response>/role-tag scaffold (which would interrupt a real rollout). New on RenderedTokens: - is_content: list[bool] field (empty when the renderer opts out, same policy as sampled_mask) - content_token_spans_by_role() - content_mask_for_roles(roles) New module-level helpers in base.py: - attribute_text_segments(tokenizer, segments) — single-BPE-pass attribution via offset_mapping; auto-loads a vanilla offset-capable tokenizer when the supplied one doesn't track offsets (fastokens patch), cached process-globally per model name. - build_training_sample(..., content_sft_roles=...) — opt-in body-only supervision for roles the model never samples. Falls back to the prior role_to_mask + sampled_mask behaviour when is_content is empty. Wired through every hand-coded renderer: qwen3, qwen3.5, qwen3.6 (inherits), qwen3-vl, glm5, glm5.1, glm4.5, kimi-k2, kimi-k2.5/2.6, minimax-m2, deepseek-v3, nemotron-3, laguna-xs.2, gpt-oss. Concatenated wrap+body emits go through emit_text_segments (or per-renderer equivalents) so BPE merges at the boundary stay byte-identical with the prior single-emit path. Renderers whose tokenizer doesn't support offset_mapping (Kimi, MiniMax with its known fastokens edge case) use boundary-aware emit patterns or a per-renderer overlap rule to keep body bytes recoverable. Multimodal placeholders (<|image_pad|>, <|media_pad|>) are body (is_content=True) — they represent caller-provided image data in token form. The surrounding vision/media wrap specials are scaffold. Fixed along the way: - nemotron3: off-by-one in message_roles when a default system was auto-injected. Now indexes the caller-provided message list. - kimi_k2: same off-by-one fixed via a caller_messages snapshot. Tests: 10 invariants × 17-model matrix in tests/test_is_content.py. Token IDs stay byte-identical vs apply_chat_template across every renderer; existing test_render_ids / test_sampled_mask / test_bridge / test_build_helpers / test_tokens_per_message / test_roundtrip / test_multimodal stay green. DefaultRenderer leaves is_content empty (Jinja is opaque), same policy as sampled_mask. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Output of `uv run ruff format` after wiring is_content through the renderers in the previous commit. No semantic changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

macroscopeapp · 2026-05-19T12:12:05Z

Approvability

Verdict: Needs human review

New feature adding per-token body/scaffold attribution (is_content mask) across 15+ renderer files, with changes to build_training_sample that affect how training loss masks are computed. The scope, complexity, and training-behavior impact warrant human review.

No code changes detected at 16c04a9. Prior analysis still applies.

^{You can customize Macroscope's approvability policy. Learn more.}

snimu · 2026-05-19T12:16:11Z

@macroscope-app

macroscopeapp · 2026-05-19T12:16:46Z

Hi! How can I help? I can:

Run a full code review on this PR
Answer questions about the code or architecture
Help fix issues (if you identify any)

Just let me know what you need!

snimu · 2026-05-19T12:17:12Z

please run a correctness check

snimu · 2026-05-19T12:18:42Z

@macroscope-app please run a correctness check

macroscopeapp · 2026-05-19T12:19:01Z

Correctness check is running. Results will be posted when complete.

Previously completed check runs:

Macroscope - Approvability Check

…y_role content_token_spans_by_role only checked that is_content matched token_ids length, but it then walks spans returned by message_token_spans() — whose values are indices into message_indices. When len(message_indices) > len(token_ids), the span end can exceed len(is_content), and the inner loop dereferences self.is_content[k] past the end, raising IndexError. The sister method content_mask_for_roles already checks both lengths (introduced in the same commit). Match the precedent. Not reachable through the renderer pipeline today — every hand-coded renderer populates the four parallel lists in lock-step — but the dataclass doesn't enforce the invariant, so a manually-constructed RenderedTokens (e.g. a test fixture) could trip it. The guard is cheap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

generate() already calls renderer.render() internally when the caller doesn't pre-supply prompt_ids, producing a RenderedTokens that carries token_ids, message_indices, sampled_mask, is_content, message_roles, and multi_modal_data. Previously we surfaced only token_ids and multi_modal_data and dropped the rest at the function boundary. Callers that wanted per-token attribution downstream (verifiers' RendererClient → prime-rl, for SFT-on-tool-body / selective loss masking) had no way to recover it without a second render pass. Two surfaces: - New return field ``prompt_attribution``: the full RenderedTokens for the prompt — either the one this call computed via render() or the one the caller threaded in alongside prompt_ids. Downstream consumers call e.g. ``content_mask_for_roles({"tool"})`` on it to build the SFT-on-tool-body mask without re-rendering. - New optional parameter ``prompt_attribution``: callers that pre-built prompt_ids (the multi-turn bridge path in verifiers) can hand in the RenderedTokens that bridge_to_next_turn returned, and it surfaces on the result unchanged. Mirrors the existing multi_modal_data plumbing — same shape, same None-default-when-unknown semantics. No behavioural change for callers that don't read the new field. Tests: - Existing parse-and-build test now asserts prompt_attribution carries every populated RenderedTokens field through verbatim. - New test for the pre-built-prompt-with-attribution path (the bridge shape verifiers will use). - New assertion that pre-built-prompt-without-attribution returns ``None`` so callers can detect the gap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

snimu · 2026-05-19T13:36:03Z

@macroscope-app please run a correctness check again

macroscopeapp · 2026-05-19T13:36:23Z

Correctness check complete. No issues found.

…k-2026-05-19

hallerite

Lgtm overall. One concern I have is with the multimodal models & their image pad tokens – how is this handled when they are expanded vs. un-expanded?

snimu and others added 2 commits May 19, 2026 13:36

style: apply ruff format to is_content renderer/test sources

42d1ffd

Output of `uv run ruff format` after wiring is_content through the renderers in the previous commit. No semantic changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

snimu requested a review from hallerite May 19, 2026 12:11

macroscopeapp Bot reviewed May 19, 2026

View reviewed changes

Comment thread renderers/base.py Outdated

snimu and others added 2 commits May 19, 2026 14:25

Merge remote-tracking branch 'origin/main' into sebastian/content-mas…

16c04a9

…k-2026-05-19

hallerite force-pushed the sebastian/content-mask-2026-05-19 branch from 02c3e85 to 16c04a9 Compare May 20, 2026 13:25

hallerite reviewed May 20, 2026

View reviewed changes

hallerite approved these changes May 20, 2026

View reviewed changes

hallerite merged commit 1691f87 into main May 20, 2026
11 checks passed

hallerite deleted the sebastian/content-mask-2026-05-19 branch May 20, 2026 14:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(base): per-token `is_content` mask for body/scaffold attribution#53

feat(base): per-token `is_content` mask for body/scaffold attribution#53
hallerite merged 5 commits into
mainfrom
sebastian/content-mask-2026-05-19

snimu commented May 19, 2026 •

edited by cursor Bot

Loading

Uh oh!

macroscopeapp Bot commented May 19, 2026 •

edited

Loading

Uh oh!

snimu commented May 19, 2026

Uh oh!

macroscopeapp Bot commented May 19, 2026

Uh oh!

snimu commented May 19, 2026

Uh oh!

snimu commented May 19, 2026

Uh oh!

macroscopeapp Bot commented May 19, 2026

Uh oh!

Uh oh!

snimu commented May 19, 2026

Uh oh!

macroscopeapp Bot commented May 19, 2026

Uh oh!

hallerite left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

snimu commented May 19, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

API

How it works

Per-renderer coverage

Tests

Additional fixes

Notes for the maintainer

Add per-token is_content body/scaffold attribution mask to all renderers

Uh oh!

macroscopeapp Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approvability

Uh oh!

snimu commented May 19, 2026

Uh oh!

macroscopeapp Bot commented May 19, 2026

Uh oh!

snimu commented May 19, 2026

Uh oh!

snimu commented May 19, 2026

Uh oh!

macroscopeapp Bot commented May 19, 2026

Uh oh!

Uh oh!

snimu commented May 19, 2026

Uh oh!

macroscopeapp Bot commented May 19, 2026

Uh oh!

hallerite left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

snimu commented May 19, 2026 •

edited by cursor Bot

Loading

Add per-token `is_content` body/scaffold attribution mask to all renderers

macroscopeapp Bot commented May 19, 2026 •

edited

Loading